บทนำการเขียนโปรแกรมด้วยทริทอน: จากการทำงานแบบเร่งรีบไปสู่การประมวลผลแบบแบ่งเป็นบล็อก

การเปลี่ยนผ่านจาก โหมดเร่งรีบของพายทอร์ช ไปยัง ทริทอน ต้องมีการเปลี่ยนแปลงจากการมองเห็นเทนเซอร์เป็นวัตถุเดียวทั้งชิ้น ไปเป็นการมองเห็นพวกเขาเป็นชุดข้อมูลที่แยกจากกันและจัดการได้ บล็อก หรือชิ้นส่วน

1. เท็นเซอร์ของพายทอร์ชกับทริทอน

จำเป็นอย่างยิ่งที่จะต้องแยกแยะ เท็นเซอร์ของทริทอน จาก เท็นเซอร์ของพายทอร์ช. เท็นเซอร์ของพายทอร์ชคือ วัตถุพีทอนฝั่งโฮสต์ ที่ห่อหุ้มข้อมูลรูปร่าง ชนิดข้อมูล อุปกรณ์ การเคลื่อนที่ และเมตาดาต้าเก็บข้อมูล ในขณะที่ทริทอนทำงานกับ ตัวชี้ข้อมูลดิบ ภายในบล็อกหน่วยความจำเฉพาะเจาะจง ทำให้สามารถปรับปรุงระดับต่ำได้อย่างมาก

2. จุดตันของการดำเนินการแบบเร่งรีบ

ในการดำเนินการแบบเร่งรีบมาตรฐาน ทุกการดำเนินการ (เช่น การบวก แล้วตามด้วยฟังก์ชันเรลู) ต้องใช้การเรียกเคอร์เนลแยกต่างหาก และ การส่งข้อมูลไปกลับหน่วยความจำทั่วโลก. นี่คือจุดตันหลักในการคำนวณด้วยจีพียูสมัยใหม่ ทริทอนแก้ไขจุดนี้โดย รวม การดำเนินการในเคอร์เนลเดียวที่ประมวลผลบล็อกข้อมูล (เช่น 128, 256 หรือ 512 องค์ประกอบ) โดยตรงในหน่วยความจำภายในตัวประมวลผล

3. แนวคิดการประมวลผลแบบแบ่งเป็นบล็อก

แทนที่จะคิดในระดับสเกลาร์ของเธรดซูดา ทริทอนใช้ เอสพีเอ็มดี (โปรแกรมเดียว ข้อมูลหลายชุด) ในระดับบล็อก คุณเขียนเคอร์เนลเดียว และทริทอนเริ่มต้นหลายอินสแตนซ์ทั่วทั้งกริด แต่ละอินสแตนซ์ใช้ program_id เพื่อคำนวณว่าบล็อกหน่วยความจำใดที่มันครอบครอง

4. การตั้งค่าสภาพแวดล้อม

เพื่อเริ่มต้น ติดตั้งทริทอนในสภาพแวดล้อมที่สะอาด (โดยใช้คอนดาหรือเวนว) เพื่อให้มั่นใจว่าไม่มีความขัดแย้งระหว่างการพึ่งพา กับเครื่องมือซูดาที่มีอยู่เดิม: pip install triton.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary difference between a PyTorch tensor and a Triton tensor within a kernel?

Triton tensors contain Python metadata like strides; PyTorch tensors are raw pointers.

A PyTorch tensor is a host-side object wrapping metadata; a Triton tensor represents blocks of data processed at the compiler level.

There is no difference; they are the same object.

Triton tensors are stored on the CPU, while PyTorch tensors are on the GPU.

QUESTION 2

Why is 'Eager Mode' considered a bottleneck for modern GPU performance?

Because it uses too much CPU memory.

Every operation requires a separate kernel launch and a global memory round-trip.

It cannot handle floating-point numbers.

It lacks support for the Python language.

QUESTION 3

What is the result of installing Triton in a 'dirty' environment with conflicting CUDA toolkits?

Triton will automatically fix the CUDA path.

It may lead to library version mismatches and kernel compilation errors.

The GPU will run faster due to multiple toolkit options.

Triton does not use CUDA, so there is no conflict.

QUESTION 4

Draw the mapping from pid to index range for N=1000, BLOCK_SIZE=256.

pid 0: [0, 256); pid 1: [256, 512); pid 2: [512, 768); pid 3: [768, 1000)

pid 0: [0, 1000)

pid 0: [0, 256); pid 1: [257, 512); pid 2: [513, 768); pid 3: [769, 1000)

pid 1: [0, 256); pid 2: [256, 512); pid 3: [512, 768); pid 4: [768, 1000)

QUESTION 5

In block-based parallelism, the instruction shift moves from 'compute one element' to:

'Compute one entire tensor'.

'Compute one block of 128/256/512 elements'.

'Compute one scalar at a time'.

'Let the CPU handle the math'.